Top Manhattan Restaurant Analysis

1 Introduction

The goal of our project was to establish a link between : the hygiene control and the rating of the different restaurant. The idea was simple, we found a database of the control hygiene for different restaurant in Manhattan and we decided to scrape the review from Trip Advisor to see if we could do some links between them.

Our assumption were that certain type of restaurant would be impacted differently in the rating when they had a bad hygiene control. We thought that it would depend mainly on the location, the price and the type of cuisine. Basically a cheap restaurant should be less impacted than a more costly one, since logically people would not care a lot because they are not paying that much. On the contrary when people pay a lot of money to eat somewhere it’s seems logical that they would expect some high quality and irreproachable hygiene. Also the cuisine type and location were believed to play kind of the same role since at least in the US some neighborhood are poorer and some type of cuisine are seen as more cheap or more fancy.

We are going to present the process that enable us to come to our finding. Firstly we’ll talk about the different method we used and then present them along with our R code. Finally we’ll present our findings and conclude on them.

2 Methodology

In our project we used quite a lot of concept/method that we didn’t know before, so it seems interesting to have a methodology section and to try to present them a bit.

The first thing we needed to do is scraping the data from TripAdvisor. To do so we needed to understand a little bit how html and css code worked, but thanks to our web browser and its development tools we can just highlight the object we want for instance the restaurant name and the devloper tools shows us the relevant div(a section in the html code). Then we needed R to get the information we wanted on the pages we wanted. To do that we used the package RSelenium and the software Docker; what it does is emulating a web browser from the terminal and than R can access it with selenium commands and thus get the html and the information we wanted on the desired page. Note that each time we scrape we needed to run Docker server from the terminal, we get to the problem and technical details later. We did use this process because TripAdvisor API did not allowed us to register since we are not a business.

After that we find ourselves with two databases. The one we “created” was really messy so we needed to clean it a lot. After some basics transformation on the data as such as putting the date in date format and calculating an average column for the rating, the most interesting things we used in this parts was probably the Google API and the nest function of tidyverse. The nest function basically allow you to put data frame inside data frame, the structure is more of a list inside a data frame but the idea is there. This was really helpful to us since we had one line per restaurant but hundreds of review for each restaurant. So we nested the two data frame and join them together which was no easy task because of the differences in the names. At this point we wanted to plot the restaurant we scraped on a map. We had the addresses but to plot them on a map we needed the geolocalisation. The geocode function allow us to get this information from Google but when we collected the data some where randomly missing. To solve that problem we used a Google API key to register ourselves which allow us to access a certain amount of request on the Google API, it worked like a charm. To do so we also needed to use the devtool package so we could install a custom version of ggplot where register_google function allow us to register our API key. After all these steps we finally had our data ready to work on.

At this point, it was time to explore our data and plot some graph and maps. First we plot some basic graph, nothing fancy but it allow us to see clearly what we had. Then we us the leaflet package to plot the restaurant on an interactive map where we could display them by categories or hide some of them. The plotly allowed us to do the same for the plot of the time series (average rating vs date). For the time series we also needed to pad the series since obviously we don’t had review on a regular basis.

At the end we decided to present our project in the form of a website. We used GitHub page to host our site. We thus had to learn once again a bit of html and css and to use Jekyll. Jekyll allowed us to host our website locally which was more convenient to create it. Jekyll is also installed and used with terminal.

We also wanted to mentioned that due to the scraping part, when we cleaned the data we were forced to eliminate some of the restaurant since some data were lacking. In the end we were capable of having complete data for a bit more than 300 restaurant.

We used quite a lot of function, packages and concept, it could seems a little bit messy but to go more into details our code is following this section with some command on what command does what.

2.1 Packages

This is all the packages we used :

library(rvest)
library(RSelenium)
library(wdman)
library(XML)
library(plyr)
library(tidyverse)
library(ggmap)
library(lubridate) 
library(leaflet)## to plot interactive maps
library(devtools)## to install new ggmap package and use google api
library(knitr)
library(padr)## to complete the data missing in the time series
library(plotly)## to hae interactive plot
library(broom)
library(kableExtra)

3 Scrape

3.1 Connecting R to the docker server

To execute this code you first need to install docker on you computer and to run it on the terminal with the line of code docker run. After that you can access your emulated web browser with the following command :

remDr <- remoteDriver(remoteServerAddr = "localhost", port = 4445L, 
                      browserName = "chrome")
remDr$open()
remDr$navigate("http://www.google.com") # navigate to the Google web page
remDr$getTitle() # If it returns "Google", the server is operational

3.3 Scraping the data

3.3.1 Basic information about the restaurant

The first thing we did was scrape the name, the type of cuisine and price, and the page URL of each restaurant we could. The problem was that at some point we hit the query limit so we were not able to take them all and some data were wrong in parts because RSelenium doesn’t have adblock so we get some data about the advertising. Also since TripAdvisor restaurant display is conceive around an interactive box, we need to click to next page every 30 restaurant. But with the following code we were able to scrape around 1800 restaurant name and webpage.

# Creation of the dataframe which will stock the data scraped
df <- data.frame(name = character(), cuisine = character(), page_url = character(), stringsAsFactors = FALSE)

#scrape
pb = txtProgressBar(min = 0, max = 1, style = 3) #Creation of a progress bar to visualize the progression of the for loop
for(i in 1 : 100 ){
  
  overview <- remDr$findElement(using="css", value="[class='deckA eatery_overview']") #
  
  #read the name
  name.elems <- overview$findChildElements(using="css", value="[class='property_title']")
  
  for(j in 1:length(name.elems)){
    name.elem <- name.elems [[j]]
    df[j+i*length(name.elems)-length(name.elems),1] <- name.elem$getElementText()
  }
  
  ## read the cuisines attribute
  cuisine.elems <- overview$findChildElements(using="css", value="[class='cuisines']")
  
  for(k in 1:length(cuisine.elems)){
    cuisine.elem <- cuisine.elems [[k]]
    df[k+i*length(cuisine.elems)-length(cuisine.elems),2] <- cuisine.elem$getElementText()
  }
  
  ## read the page url of the restaurant
  url.elems <- overview$findChildElements(using="css", value="[class='property_title']")
  
  for(o in 1:length(url.elems)){
    url.elem <- url.elems [[o]]
    rhs <- sub(".*' *(.*?) *'.*", "\\1", url.elem$getElementAttribute("onclick"))
    df[o+i*length(url.elems)-length(url.elems),3] <- str_c("https://www.tripadvisor.com", rhs, sep = "")
  }
  ## go to next page 
  webElems <- remDr$findElement(using = 'css', value="#EATERY_LIST_CONTENTS > div.deckTools.btm > div > a.nav.next.rndBtn.ui_button.primary.taLnk")
  webElems$clickElement()
  Sys.sleep(2)
  
  setTxtProgressBar(pb,i)
# pause each 5 entry for a minute to try to not go over the query limit
  if (i %% 10 == 0 ){Sys.sleep(1)}
}

3.3.2 Scraping each review for all restaurant

Now that we had the specific web address of a lot of restaurant, we needed to get all the review and date of review of these restaurants. The following code work the same way as the one before except that the page we visit is given to us by our “first” scrape. We were able to get only around 500 restaurant with full review due to similar problem as the first code.

#we encounter some errors during the scrape so we remove all rows with NA value
df <- na.omit(df)
#Creation of the new column in the data frame for storing each with the associated date and the restaurant address
df$rate <- list(NA)
df$rate.date <- list(NA)
df <- mutate(df,address = "")
#The first row is a advertising in the TA web, it lead to a different page url which our code can't scrape. So we assign a right address to start
df$page_url[[1]] <- "https://www.tripadvisor.com/Restaurant_Review-g60763-d10145683-Reviews-Dominique_Bistro-New_York_City_New_York.html"

pb = txtProgressBar(min = 0, max = length(df$page_url), style = 3) #Setting up the progress bar
for(m in 1500 : length(df$page_url)){
  setTxtProgressBar(pb,m, title = scrape.total)
  remDr$navigate(df$page_url[[m]])
  
  Sys.sleep(3)
  remDr$screenshot(display = TRUE)#in order to know which restaurant is under the scraping procedure
  remDr$getTitle()
  
  df1 <- data.frame(rating = integer(), date = character(),stringsAsFactors = FALSE)
  
  #We obtain the number of page that contain the review (it's internal pages on the restaurant page)
  pagenumbers <- remDr$findElement(using = 'css', value = "[class='pageNum last taLnk ']")
  pagenumbers <- as.numeric(pagenumbers$getElementAttribute("data-page-number"))
  
  #Scraping the restaurant's address
  address <- remDr$findElement(using = 'css', value = "[class='street-address']")
  address <- address$getElementText()
  
  df$address[[m]] <- address[[1]]
  
  #Scraping of each review with the related date
  pb1 = txtProgressBar(min = 0, max = pagenumbers, style = 3)
  for(q in 1:pagenumbers){
    
    list.container <- remDr$findElement(using = 'css', value = "[class='listContainer ']")
    
    rateanddate.elems <- list.container$findChildElements(using = 'css', value="[class='rating reviewItemInline']")
    
    for(w in 1: length(rateanddate.elems)){
      rateanddate <- rateanddate.elems [[w]]
      
      rating.elem <- rateanddate$findChildElement(using = 'css selector', value = "div > *:first-child")
      
      df1[w+q*10-10,1] <- rating.elem$getElementAttribute("class")
      df1 [w+q*10-10,2] <- rateanddate$getElementText()
    }
    
    if (q < pagenumbers){
      webElems <- remDr$findElement(using = 'css', value="[class='nav next taLnk ui_button primary']")
      webElems$clickElement()
      Sys.sleep(3)
    }
    setTxtProgressBar(pb1,q,title = scrape.comment)
  }
  #Reporting the value scraped in the data frame
  df1 <- as.list(df1)
  df$rate[[m]] <- df1$rating
  df$rate.date[[m]] <- df1$date

  # pause each time we scrape 5 restaurant review so the server doesn't crash. we pause for 3min
  if (m %% 5 == 0 ){Sys.sleep(180)}
  
  print(m)
}

3.4 Exporting the data frame

At this point it was important to save the scraped data a first time before cleaning them so we wouldn’t have to scrape them again. Note that the scrapping takes a lot of time.

## save data frame :
save(df,file="TA_db.Rda")

4 Data Cleaning

In the data cleaning parts we are aiming to simplify our database and stock all the information in an efficient way so we could exploit them. Since we have two data set we also are going to join them.

4.1 Loading the scraped data

We simply load the data we scraped :

load("TA_db.RDA")

4.2 Data transformation

4.2.1 Format

First we put our scraped variable in the format needed to exploit them. All the one that are supposed to be text are formatted in a character form and the date are stock in a date format using mdy. One problem we had was that TripAdvisor write the most recent review in a format that is not exploitable for us (something written like : two weeks ago). So we since our mdy function were putting NA for these data, we decided to simply delete them.

#transform strings as character
TA.db <- TA.db %>% select(-1) %>% mutate(name = as.character(name),cuisine = as.character(cuisine),page_url = as.character(page_url))

#Convert date column into date format
for(i in 1: length(TA.db$rate.date)){

TA.db$rate.date[[i]] <- mdy(TA.db$rate.date[[i]])

}
##The most recent rate date were written in a format like "2 days ago". So it can't be transformed to a date format and after the previous the value is NA. We decid to remove them
###define function that remove NA from a list
na.omit.list <- function(y) { return(y[!sapply(y, function(x) all(is.na(x)))]) }
for (j in 1: length(TA.db$rate.date)){

 TA.db$rate[[j]] <- TA.db$rate [[j]] [!is.na(TA.db$rate.date[[j]])] 

 TA.db$rate.date[[j]] <- na.omit.list(TA.db$rate.date[[j]])

}

4.2.2 Rating transformation

The rating information on TripAdvisor was stock in the html as a rating bubble because they show the rating on the site with a bubble rating image. We thus needed to convert the bubble text to a numerical value so we could afterwards, calculate average in time.

#modify the bubble rating into an integer (1-5)
pb1 = txtProgressBar(min = 0, max = length(TA.db$rate), style = 3)

for(j in 1:length(TA.db$rate)){
  
x <- TA.db$rate [[j]]

for(i in 1 : length(x)){
x [i] <- str_replace(x[i],"ui_bubble_rating bubble_50","5")
x [i] <- str_replace(x[i],"ui_bubble_rating bubble_40","4")
x [i] <- str_replace(x[i],"ui_bubble_rating bubble_30","3")
x [i] <- str_replace(x[i],"ui_bubble_rating bubble_20","2")
x [i] <- str_replace(x[i],"ui_bubble_rating bubble_10","1")
}
x<- as.numeric(x)

TA.db$rate [[j]] <- x
setTxtProgressBar(pb1,j)
}

4.2.3 Price extraction

Since the price range of the restaurant is scraped with the type of cuisine, we need to split the strings in order to keep only the price. We obviously doing so, lose the information about the cuisine type but this information is also present in the other database so we’ll use this one.

#to be sure that we catch the first capital letter
TA.db$cuisine <- str_to_upper(TA.db$cuisine)

for(i in 1: length(TA.db$cuisine)){
  
TA.db$cuisine[[i]] <- str_extract(TA.db$cuisine[[i]],"[^[A-Z]]+")
  
}
#Setting the price as a factor
TA.db$cuisine <- as_factor(TA.db$cuisine)

4.2.4 Number of review per restaurant

We add a column just to know how much review we have for a restaurant, this seemed like an important information to have.

#Creation of the column for stocking the number of review
TA.db <- mutate(TA.db, nbr_of_review = 0)

for(i in 1 : length(TA.db$rate)){
  TA.db$nbr_of_review [[i]]<- length(TA.db$rate[[i]]) 
}

4.2.5 Computation of the mean rate for each successive date

To plot some time series we needed the average rate in time. What this code does is providing us this information by taking the cumulative sum of the rating until each date, divide that sum by the number of review until the same date and then stock that in a new variable.

#Creation of the column
TA.db$average <-  list(NA)

for(i in 1: length(TA.db$rate)){
  ro <- 1 : length(TA.db$rate[[i]])
TA.db$average [[i]] <- (cumsum(rev(TA.db$rate[[i]])) / ro)
}
#
for(i in 1: length(TA.db$rate)){
TA.db$average[[i]] <- rev(TA.db$average[[i]])
}

We then saved this before nesting the database so we could go back to a clean scrape just in case.

save(TA.db,file = "clean_TA_db.RDA")

4.3 Data nesting

4.3.1 Nesting of the scraped data

Now we realize that it would be better to have nested data so first we artificially create a data frame as if the scrape had taken one line per review, date,and average and written the other info on each row and then nest it. (ATTENTION : take long time to run)

#Creating the artificial data frame
df <- data_frame(Name = character(),
                 Price=character(),
                 Page_url=character(),
                 Address = character(),
                 Rate=integer(),
                 Rate_date=date(),
                 Nbr_of_review = integer(),
                 Average = double()
                 )

pb1 = txtProgressBar(min = 0, max = length(TA.db$name), style = 3)

for(i in 1:length(TA.db$name)){
k <- length(df$Name)
for(j in 1:length(TA.db$rate[[i]])){
  
  df[j+k,1] <- TA.db$name[[i]]
  df[j+k,2] <- as.character(TA.db$cuisine[[i]])
  df[j+k,3] <- TA.db$page_url[[i]]
  df[j+k,4] <- TA.db$address[[i]]
  df[j+k,5] <- TA.db$rate[[i]][[j]]
  df[j+k,6] <- TA.db$rate.date[[i]][[j]]
  df[j+k,7] <- TA.db$nbr_of_review[[i]]
  df[j+k,8] <- TA.db$average[[i]][[j]]
}
  setTxtProgressBar(pb1,i)
}

## nest the data 
nest_df <- df %>% group_by(Name,Price,Page_url,Address,Nbr_of_review) %>% nest()

4.3.2 Nesting the hygiene inspections database

We load the hygiene data and nest it since we have several control for specific restaurant.

control <- read_csv("./data/DOHMH_New_York_City_Restaurant_Inspection_Results .csv")

nest_control <- control %>% group_by(CAMIS,DBA,BORO,BUILDING,STREET,ZIPCODE,PHONE,`CUISINE DESCRIPTION`) %>% nest()

4.4 Joining the two databases

4.4.1 Harmonization of the two databases

Now we wanted to join the two data frames together. It appears that we had not a lot of match at first, so we proceed by iteration and harmonize the data as we go. First the needed to have the same column name. Then we realize that we had some name in capital and some weird spacing, so we decided to eliminate all space and put everything in capital letter. We also find that there was some problem with the address, some were note south street other just S and the number where in one database written with a TH after any number but in the other database it was not the case. In the name of the restaurant some words appears in a database but not in the other for instance sometimes it’s written bar at the end of the name, so we suppress some word that were useless. We fix almost all the problem we had and what we do is quite obvious in the code. Nevertheless, the problem was not totally solved because they were exception for unique restaurant and it wasn’t logical to write some code for only these. Also some restaurants that we scraped were just not in the hygiene database. In the end out of the 500 we were able to keep 300 restaurant.

colnames(nest_df)[1] <- "DBA" ##rename col of nest df to the same name as nest control

address <- as.data.frame(str_split_fixed(nest_df$Address," ",2)) ## split the address of nest df in 2 to be like nest control

nest_df <- nest_df %>% mutate(Address = address$V2, BUILDING = address$V1) ## add to nest df the the 2 "new" col

colnames(nest_df)[4] <- "STREET" ## rename the "new" col that wasn't already right

## loop to transform nest df DBA in capital
for (i in 1: length(nest_df$DBA)){
  
  nest_df$DBA[[i]] <- str_to_upper(nest_df$DBA[[i]])
  
}
## transform nest df street capital in a char vector and not a list
nest_df$STREET <- lapply(nest_df$STREET, as.character)

## loop to transform nest df street in capital
for (i in 1: length(nest_df$STREET)){
  
  nest_df$STREET[[i]] <- str_to_upper(nest_df$STREET[[i]])
  
}

## loop to transform nest control street in capital
for (i in 1: length(nest_control$STREET)){
  
  nest_control$STREET[[i]] <- str_to_upper(nest_control$STREET[[i]])
  
}

## loop to transform nest control DBA in capital
for (i in 1: length(nest_control$DBA)){
  
  nest_control$DBA[[i]] <- str_to_upper(nest_control$DBA[[i]])
  
}

## unlist nest df to be a char vector
nest_df$STREET <- unlist(nest_df$STREET)

## need to add the TH after any number in the nest_control STREET and other stuff 

for(i in 1:length(nest_control$STREET)){
  
nest_control$STREET [[i]] <- str_replace(nest_control$STREET [[i]],"(?<=[[:digit:]]) ","TH ")##add TH in the place of a space after any number
nest_control$STREET [[i]] <- str_replace(nest_control$STREET [[i]],"(?<=1)TH","ST ")##add ST afet 1 instead of TH
nest_control$STREET [[i]] <- str_replace(nest_control$STREET [[i]],"(?<=2)TH","ND ")##add ND afet 1 instead of TH
nest_control$STREET [[i]] <- str_replace(nest_control$STREET [[i]],"(?<=3)TH","RD ")##add RD afet 1 instead of TH
nest_control$STREET [[i]] <- str_replace_all(nest_control$STREET [[i]]," ","")
nest_control$STREET [[i]] <- str_replace_all(nest_control$STREET [[i]],"STREET","ST")
nest_control$STREET [[i]] <- str_replace_all(nest_control$STREET [[i]],"EAST","E")
nest_control$STREET [[i]] <- str_replace_all(nest_control$STREET [[i]],"WEST","W")
nest_control$STREET [[i]] <- str_replace_all(nest_control$STREET [[i]],"SOUTH","S")
nest_control$STREET [[i]] <- str_replace_all(nest_control$STREET [[i]],"NORTH","N")
nest_control$STREET [[i]] <- str_replace_all(nest_control$STREET [[i]],"AVENUE","AVE")
#nest_control$STREET [[i]] <- str_replace_all(nest_control$STREET [[i]],"AVES","AVE")
 }

for(i in 1:length(nest_df$STREET)){

nest_df$STREET [[i]] <- str_replace_all(nest_df$STREET [[i]]," ","")
nest_df$STREET [[i]] <- str_replace_all(nest_df$STREET [[i]],"STREET","ST")
nest_df$STREET [[i]] <- str_replace_all(nest_df$STREET [[i]],"EAST","E")
nest_df$STREET [[i]] <- str_replace_all(nest_df$STREET [[i]],"WEST","W")
nest_df$STREET [[i]] <- str_replace_all(nest_df$STREET [[i]],"SOUTH","S")
nest_df$STREET [[i]] <- str_replace_all(nest_df$STREET [[i]],"NORTH","N")
nest_df$STREET [[i]] <- str_replace_all(nest_df$STREET [[i]],"AVENUE","AVE")
#nest_df$STREET [[i]] <- str_replace_all(nest_df$STREET [[i]],"AVES","AVE")
}

## now we reject all useless stuf in the DBA for instance bar, restaurant, and etc...

 for(i in 1:length(nest_control$DBA)){
  
nest_control$DBA [[i]] <- str_replace_all(nest_control$DBA [[i]]," ","")
nest_control$DBA [[i]] <- str_replace_all(nest_control$DBA [[i]],"&","")
nest_control$DBA [[i]] <- str_replace_all(nest_control$DBA [[i]],"THE","")
nest_control$DBA [[i]] <- str_replace_all(nest_control$DBA [[i]],"AND","")
nest_control$DBA [[i]] <- str_replace_all(nest_control$DBA [[i]],"BAR","")
nest_control$DBA [[i]] <- str_replace_all(nest_control$DBA [[i]],"RESTAURANT","")
nest_control$DBA [[i]] <- str_replace_all(nest_control$DBA [[i]],"CAFE","")
nest_control$DBA [[i]] <- str_replace_all(nest_control$DBA [[i]],"CHEESE","")
nest_control$DBA [[i]] <- str_replace_all(nest_control$DBA [[i]],"WINE","")
nest_control$DBA [[i]] <- str_replace_all(nest_control$DBA [[i]],"STREET","ST")
 }

 for(i in 1:length(nest_df$DBA)){
  
nest_df$DBA [[i]] <- str_replace_all(nest_df$DBA [[i]]," ","")
nest_df$DBA [[i]] <- str_replace_all(nest_df$DBA [[i]],"&","")
nest_df$DBA [[i]] <- str_replace_all(nest_df$DBA [[i]],"THE","")
nest_df$DBA [[i]] <- str_replace_all(nest_df$DBA [[i]],"AND","")
nest_df$DBA [[i]] <- str_replace_all(nest_df$DBA [[i]],"BAR","")
nest_df$DBA [[i]] <- str_replace_all(nest_df$DBA [[i]],"RESTAURANT","")
nest_df$DBA [[i]] <- str_replace_all(nest_df$DBA [[i]],"CAFE","")
nest_df$DBA [[i]] <- str_replace_all(nest_df$DBA [[i]],"CHEESE","")
nest_df$DBA [[i]] <- str_replace_all(nest_df$DBA [[i]],"WINE","")
nest_df$DBA [[i]] <- str_replace_all(nest_df$DBA [[i]],"STREET","ST")
 }

4.4.2 Join

Now it is possible to properly join the two data sets together with the left_join command :

##left join of the 2 nested data frame
Merged_df <- left_join(nest_df,nest_control,by = c("DBA", "STREET"))
## na.omit of the whole nested data frame
Merged_df <- na.omit(Merged_df)

4.5 Adding the geolocalisation

We add the goelocalisation with the Google API, as mentioned before we needed to register for a key and load a new package of ggplot through the package devtools. Then we just loop on the address and get the coordinates :

register_google(key = "AIzaSyA6LwK5aDH3JTsgpRrt3FQv4G7Jv5siz74")

for(i in 1 : length(Merged_df$STREET) ){
  Merged_df$location [[i]] <- character()
Merged_df$location [i] <- str_c(Merged_df$BUILDING.x[i], Merged_df$STREET[i],Merged_df$BORO, sep = " ")
}

for(i in 1:length(Merged_df$location)){
Merged_df$location[[i]] <- geocode(Merged_df$location[[i]])
}

4.6 Exporting the database

We are able now save our clean and merged database in order to work on:

Merged_df <- Merged_df[!duplicated(Merged_df$DBA),] ## enlève les duplication des noms créer par left_join

save(Merged_df,file = "Merged_df.RDA")

5 Analysis

5.1 Exploratory analysis

5.1.1 Date interval

We first want to know what is the time span of our data, so we display the time interval of the scraped database and the one of the hygiene inspection database :

d_range_TA <- Merged_df %>%
  unnest(data.x) %>%
  mutate(Rate_date = as_date(as.numeric(Rate_date)))

interval(min(d_range_TA$Rate_date), max(d_range_TA$Rate_date)) %>% kable(col.names = "Rating date interval") %>% kable_styling(position='center')
Rating date interval
2004-06-04 UTC–2018-03-22 UTC
d_range_insp <- Merged_df %>%
  unnest(data.y) %>%
  mutate(`INSPECTION DATE` = mdy(`INSPECTION DATE`)) %>%
  filter(year(`INSPECTION DATE`) > 1910)

interval(min(d_range_insp$`INSPECTION DATE`), max(d_range_insp$`INSPECTION DATE`)) %>% kable(col.names = "Inspection date interval") %>% kable_styling(position='center')
Inspection date interval
2013-04-11 UTC–2017-08-25 UTC

5.1.2 Number of restaurant by price and by type of cuisine

The two main grouping variable that we are going to use in our analysis are price and type of cuisine. First the price are divided in 3 categories : cheap, middle price, and expensive. We group the data by price and the display the number of data we have in each groups. We basically do the same for the cuisine description except that we had too much factor, so we decided at first to collapse some of them (see code for details).

Merged_df %>% group_by(Price) %>%
  dplyr::count() %>%
  kable( col.names = c("Price range", "Count"))%>% kable_styling(position='center')
Price range Count
48
€€ - €€€ 203
€€€€ 70
Merged_df <- Merged_df %>% mutate(`CUISINE DESCRIPTION` = fct_collapse(`CUISINE DESCRIPTION`,
  Asian = c("Asian", "Chinese", "Japanese", "Korean", "Vietnamese/Cambodian/Malaysia", "Thai"),
  Italian = c("Italian", "Pizza", "Pizza/Italian"),
  European = c("Eastern European", "English", "Greek", "Irish", "Mediterranean", "Portuguese", "Scandinavian", "Spanish"),
  Latin = c("Latin (Cuban, Dominican, Puerto Rican, South & Central American)", "Brazilian", "Tex-Mex", "Mexican", "Hawaiian"),
  American = c("American", "Barbecue", "Hamburgers", "Steak", "Continental"),
  "Middle Eastern" = c("Middle Eastern", "Jewish/Kosher", "Moroccan", "Turkish"),
  "Snack & Other" = c("Bagels/Pretzels", "Café/Coffee/Tea", "Delicatessen", "Donuts", "Soups", "Other", "Salads", "Sandwiches", "Sandwiches/Salads/Mixed Buffet")))

Merged_df %>% group_by(`CUISINE DESCRIPTION`) %>%
  dplyr::count() %>%
  kable( col.names = c("Type of cuisine", "Count"))%>% kable_styling(position='center')
Type of cuisine Count
American 115
Asian 28
Snack & Other 16
Latin 13
European 20
French 27
Indian 3
Italian 75
Middle Eastern 8
Seafood 10
Vegetarian 6

5.1.3 Map of Manhattan with the restaurant

We decided to plot some map regarding to the categories mentioned above. The point was to see if there where some “region” of Manhattan that were more expensive or only have restaurant with a distinct type of cuisine.

#we remark that there is 2 restaurant which the adress is false, thus they are badly plot. We fix them by hand.
register_google(key = "AIzaSyA6LwK5aDH3JTsgpRrt3FQv4G7Jv5siz74")
Merged_df$location[[171]] <- geocode("15 Main St, Brooklyn, NY 11201, États-Unis")
Merged_df$location[[189]] <- geocode("229 S 4th St, between Havemeyer St & Roebling St, Brooklyn, NY 11211-5605")

# subset out your data_filter:
df_lowprice <- subset(Merged_df, Price == "€ ")
df_mediumprice <- subset(Merged_df, Price == "€€ - €€€ ")
df_highprice <- subset(Merged_df, Price == "€€€€ ")

data_filterlist = list( df_lowprice = subset(Merged_df, Price == "€ "),
                 df_mediumprice = subset(Merged_df, Price == "€€ - €€€ "),
                 df_highprice = subset(Merged_df, Price == "€€€€ "))

# Remember we also had these groups associated with each variable? Let's put them in a list too:
layerlist = c("1 cheap", "2 middle price",
              "3 expensive")

# We can keep that same color variable:
colorFactors = colorFactor(c('red', 'orange', 'green'), domain=Merged_df$Price)

icons <- list()


icons [[1]] <- awesomeIcons(
  icon = 'ios-close',
  iconColor = 'black',
  library = 'ion',
  markerColor = "red")
icons [[2]] <- awesomeIcons(
  icon = 'ios-close',
  iconColor = 'black',
  library = 'ion',
  markerColor = "orange")
icons [[3]] <- awesomeIcons(
  icon = 'ios-close',
  iconColor = 'black',
  library = 'ion',
  markerColor = "green")



m <- leaflet() %>% addTiles()
# Now we have our loop - each time through the loop, it is adding our markers to the map object:
for (i in 1:length(data_filterlist)){
  m = addAwesomeMarkers(m, 
                        lng=unname(sapply(data_filterlist[[i]]$location, `[[`, 1)),
                        lat=unname(sapply(data_filterlist[[i]]$location, `[[`, 2)), 
                        label = data_filterlist[[i]]$DBA, 
                        icon = icons[[i]],
                        group = layerlist[i]
              )
}


m = addProviderTiles(m, "Stamen.TonerLite", group = "Toner Lite")


m = addLayersControl(m,
                      overlayGroups = layerlist,
                      baseGroups = c("basic","Toner Lite")
                      )
m

On the map we can see that there is not really some apparent neighborhood with for example more expensive restaurant. If you zoom really closely you can actually see some pattern, for instance restaurant are more expensive on 5th avenue. But maybe don’t have enough data or it’s just that Manhattan is too big and that two street, even close to each other can have really different restaurant. Thus this will not really help us in our analysis since it’s going to be hard to class the restaurant by relevant neighborhood and then analyse the effect of hygiene control on them. It still fun to play with the map since it’s interactive and it allow us to see a little bit more with what we are dealing with for instance all our restaurant are focused on Manhattan even if they were scraped from the NY page of TripAdvisor. We wanted to plot also the map by type of cuisine but early version of such a map led us to the same conclusion so we decided to drop it.

5.1.4 Barplot of the number of restaurant in each type of cuisine

We produce a bar plot of the number of restaurant we have by type of cuisine. It allow us to see that clearly there is 4 major categories. This will thus help us later to decide which graph to plot.

Merged_df %>% group_by(`CUISINE DESCRIPTION`) %>% dplyr::count() %>% ungroup() %>%
  mutate(`CUISINE DESCRIPTION` = fct_reorder(`CUISINE DESCRIPTION`, n)) %>%
  ggplot(aes(x = `CUISINE DESCRIPTION`, y = n)) +
  geom_bar(stat = "identity") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.3, hjust = 1), plot.title = element_text(hjust = 0.5)) +
  labs(title = "Number of restaurant in each type of cuisine", x = "Type of cuisine", y = "Count")

5.1.5 Price proportion across type of cuisine

We now plot the price proportion for each type of cuisine. It’s interesting to see that for the 4 main categories, we have french cuisine that tend to be more expensive, Asian that is more cheap and the two other in the middle. This fit quite well which the cliche idea of what a type of cuisine is like. It will also help us to choose which plot to make and lead us to the idea that we should plot time series for the 4 main categories with some restaurant of each price range.

Merged_df %>% group_by(`CUISINE DESCRIPTION`) %>% mutate(n = n(), percentage = n/sum(n)) %>%
  ggplot(aes(`CUISINE DESCRIPTION`, percentage, fill = Price)) +
  geom_bar(stat = "identity") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.3, hjust = 1), plot.title = element_text(hjust = 0.5)) +
  scale_y_continuous(labels = scales::percent) +
  labs(title = "Price proportion across type of cuisine", x = "Type of cuisine", y = "Proportion")

5.2 Inspection analysis

5.2.1 Number of inspection

We first create some column with the total number of inspection for each restaurant and a column that count the critical one.

Merged_df <- Merged_df %>% mutate(nb_inspection = 0, nb_insp_critical = 0)

for (i in 1:nrow(Merged_df)) {
  Merged_df$nb_inspection[[i]] <- nrow(Merged_df$data.y[[i]])
  data.y.i <- Merged_df$data.y[[i]]
  Merged_df$nb_insp_critical[[i]] <- sum(str_count(data.y.i$`CRITICAL FLAG`, "^Critical"))
}

Merged_df <- Merged_df %>% mutate(critical_ratio = nb_insp_critical/nb_inspection)

We can now plot the average number of inspection by restaurant and the average critical ratio (means when a control was judge as not good enough) :

Merged_df %>% summarise(avg_nb_inspection = mean(nb_inspection), avg_crit_ratio = mean(critical_ratio)) %>%
  kable( col.names = c("Average number of inspection", "Average critical ratio"))%>% kable_styling(position='center')
Average number of inspection Average critical ratio
17.04984 0.5449974

we do the same by type of cuisine :

Merged_df %>% group_by(`CUISINE DESCRIPTION`) %>%
  summarise(avg_nb_inspection = mean(nb_inspection), avg_crit_ratio = mean(critical_ratio)) %>%
  kable( col.names = c( "Type of cuisine", "Average number of inspection", "Average critical ratio"))%>% kable_styling(position='center')
Type of cuisine Average number of inspection Average critical ratio
American 17.23478 0.5514914
Asian 21.71429 0.5279837
Snack & Other 17.00000 0.4764363
Latin 16.00000 0.5148083
European 18.00000 0.5603765
French 12.40741 0.5110473
Indian 16.00000 0.5643501
Italian 16.82667 0.5567985
Middle Eastern 20.62500 0.5611786
Seafood 16.10000 0.5812447
Vegetarian 12.00000 0.6105006

We were expecting to see some differences in the critical ratio, but what we find out here is that there is not really a type of cuisine that get more red flag. This is kind of logical, it just means that every restaurant whatever they cook, always make hygiene mistakes. We want to add that a red flag can be for little things though because hygiene control are taken really seriously. For the control that are not critical we find out that they are often a followup control of a critical one. From the graph above we can also say that some type of cuisine get a more control for instance Asian cuisine get in average almost twice as much tested as french cuisine. This comfort us in the idea that people see Asian cuisine as more cheap than french one.

5.2.2 Plots of the number of inspection

We then plot this information in a boxplot :

Merged_df %>% group_by(`CUISINE DESCRIPTION`) %>%
  ggplot(aes(x = `CUISINE DESCRIPTION`, y = nb_inspection, color = critical_ratio)) +
  geom_boxplot() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.3, hjust = 1), plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5)) +
  labs(title = "Number of inspection by restaurant", subtitle = "Grouped by type of cuisine", x = "Type of cuisine", y = "Number of inspection") +
  scale_color_gradient(low = "green", high = "red")

And a plot of the critical ratio by cuisine :

Merged_df %>% group_by(`CUISINE DESCRIPTION`) %>%
  ggplot(aes(x = `CUISINE DESCRIPTION`, y = nb_inspection, color = critical_ratio)) +
  geom_jitter() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.3, hjust = 1),
        plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5)) +
  labs(title = "Number of inspection by restaurant", subtitle = "Grouped by type of cuisine", color = "Critical ratio", 
       x = "Type of cuisine", y = "Number of inspection") +
  scale_color_gradient(low = "green", high = "red")

These two plot gave us the same information as before but more clearly. In the last one we can observe the dispersion more precisely. It allow us to find out that the only pattern/correlation we can find is that when a restaurant is tested a lot of time he probably will have an average critical ratio, and that the extreme value for critical ratio are on restaurant that did not get controlled a lot. This seems quite logical since when the hygiene control are bad they will do a followup control so it would average the value of the critical ratio. What is more weird is a restaurant that has not so much control but a huge critical ratio, we can ask ourselves what happens here ? it’s probably just that the control were critical but in reality it was just a detail and everything else in the kitchen was good, thus they did only one followup control. All those information are interesting but it don’t really allow us to conclude something meaningful for our inferences.

We then try to make a linear regression to predict the critical ratio with the number of inspection and the cuisine type. The p-value are really bad and that let us think that there is not really any correlation between our data thus we will go no further trying to fit a prediction model.

kable(tidy(lm(Merged_df, formula = critical_ratio ~ nb_inspection + `CUISINE DESCRIPTION` + Price)))%>% kable_styling(position='center')
term estimate std.error statistic p.value
(Intercept) 0.5284424 0.0329110 16.0566949 0.0000000
nb_inspection 0.0025466 0.0009196 2.7691522 0.0059622
CUISINE DESCRIPTIONAsian -0.0347615 0.0313714 -1.1080633 0.2687018
CUISINE DESCRIPTIONSnack & Other -0.0929931 0.0441090 -2.1082572 0.0358189
CUISINE DESCRIPTIONLatin -0.0386181 0.0433280 -0.8912952 0.3734688
CUISINE DESCRIPTIONEuropean 0.0086309 0.0359052 0.2403797 0.8101963
CUISINE DESCRIPTIONFrench -0.0206245 0.0322874 -0.6387796 0.5234426
CUISINE DESCRIPTIONIndian 0.0204248 0.0862554 0.2367941 0.8129745
CUISINE DESCRIPTIONItalian 0.0044639 0.0219888 0.2030083 0.8392630
CUISINE DESCRIPTIONMiddle Eastern -0.0010916 0.0540052 -0.0202129 0.9838867
CUISINE DESCRIPTIONSeafood 0.0384283 0.0487866 0.7876806 0.4314913
CUISINE DESCRIPTIONVegetarian 0.0699440 0.0622480 1.1236350 0.2620458
Price€€ - €€€ -0.0184454 0.0270913 -0.6808590 0.4964739
Price€€€€ -0.0388988 0.0315924 -1.2312708 0.2191640

We then try to see if there was a correlation between the critical ratio and the price categories of a restaurant. Once again there is not any relationship and we clearly see that on the graph :

Merged_df %>% group_by(`CUISINE DESCRIPTION`, Price) %>% 
  ggplot(aes(x = Price, y = critical_ratio, color = Price)) +
  geom_jitter() + 
  facet_wrap(~`CUISINE DESCRIPTION`) +
  geom_point(data = Merged_df %>% 
               group_by(`CUISINE DESCRIPTION`, Price) %>% 
               summarise(avg_crit_ratio = mean(critical_ratio)), 
             aes(x = Price, y = avg_crit_ratio, color = "Mean"))

5.3 Grade analysis

It is now time to plot the time series of different restaurant. We decided to plot 9 restaurant for the 4 main categories (3 of them for each range of price). We only kept the critical control for theses plot since we want to see if there is a drop in the TripAdvisor rating after a control that was judge critical.

5.3.1 Selecting some restaurants

We thus select the restaurant we want, the code speak for itself :

french <- Merged_df %>%
  filter(`CUISINE DESCRIPTION` == "French") %>% group_by(Price) %>%
  arrange(desc(critical_ratio)) %>%
  filter(row_number() %in% 1:3) %>% 
  ungroup() %>%
  unnest(data.y, .drop = FALSE) %>%
  filter(`CRITICAL FLAG` == "Critical") %>%
  distinct(`INSPECTION DATE`, .keep_all = TRUE)%>%
  select(DBA, Price, `CUISINE DESCRIPTION`, `INSPECTION DATE`) %>%
  nest(`INSPECTION DATE`)
  
french.TA <- Merged_df %>%
  filter(`CUISINE DESCRIPTION` == "French") %>% group_by(Price) %>%
  arrange(desc(critical_ratio)) %>%
  filter(row_number() %in% 1:3) %>% 
  ungroup() %>%
  unnest(data.x, .drop = FALSE)  %>%
  distinct(Rate_date, .keep_all = TRUE)%>%
  select(DBA, Price, `CUISINE DESCRIPTION`, Average,Rate,Rate_date)%>%
  nest(Rate,Rate_date,Average)

french <- french %>% mutate(data.x = french.TA$data)

rm(french.TA)

This was only for the french restaurant but we did exactly the same for Italian, Asian and American cuisine.

5.3.2 Pad the rate time series

At this point, we encounter a new problem that we solved by padding the time series. Obviously there is not a review on Trip Advisor on a regular basis since it’s done by users. We thus use the package padr which is really handy for cases like that. What we do is insert a beginning and an ending date that are then supposed to be the max and min of the series. Then we pad the series which means first add a row for each day between those two dates, and then fill with the previous rate average that padr can find. It works well and allow us to make some plot afterwards.

## set date as date format
for(i in 1: length(french$data)){
french$data[[i]][[1]] <- mdy(french$data[[i]][[1]])}

for(i in 1: length(french$data.x)){
french$data.x[[i]][[2]] <- as_date(as.numeric(french$data.x[[i]][[2]]))}

for(i in 1: length(italian$data)){
italian$data[[i]][[1]] <- mdy(italian$data[[i]][[1]])}

for(i in 1: length(italian$data.x)){
italian$data.x[[i]][[2]] <- as_date(as.numeric(italian$data.x[[i]][[2]]))}

for(i in 1: length(american$data)){
american$data[[i]][[1]] <- mdy(american$data[[i]][[1]])}

for(i in 1: length(american$data.x)){
american$data.x[[i]][[2]] <- as_date(as.numeric(american$data.x[[i]][[2]]))}

for(i in 1: length(asian$data)){
asian$data[[i]][[1]] <- mdy(asian$data[[i]][[1]])}

for(i in 1: length(asian$data.x)){
asian$data.x[[i]][[2]] <- as_date(as.numeric(asian$data.x[[i]][[2]]))}

## now pad
for(i in 1: length(french$data.x)) {
french$data.x[[i]] <- french$data.x[[i]] %>% add_row(Rate_date = as_date("2018-06-01")) %>% add_row(Rate_date = as_date("2000-06-01"))
french$data.x[[i]] <- french$data.x[[i]] %>% pad() %>% tidyr::fill(Average)}

for(i in 1: length(italian$data.x)) {
italian$data.x[[i]] <- italian$data.x[[i]] %>% add_row(Rate_date = as_date("2018-06-01")) %>% add_row(Rate_date = as_date("2000-06-01"))
italian$data.x[[i]] <- italian$data.x[[i]] %>% pad() %>% tidyr::fill(Average)}

for(i in 1: length(american$data.x)) {
american$data.x[[i]] <- american$data.x[[i]] %>% add_row(Rate_date = as_date("2018-06-01")) %>% add_row(Rate_date = as_date("2000-06-01"))
american$data.x[[i]] <- american$data.x[[i]] %>% pad() %>% tidyr::fill(Average)}

for(i in 1: length(asian$data.x)) {
asian$data.x[[i]] <- asian$data.x[[i]] %>% add_row(Rate_date = as_date("2018-06-01")) %>% add_row(Rate_date = as_date("2000-06-01"))
asian$data.x[[i]] <- asian$data.x[[i]] %>% pad() %>% tidyr::fill(Average)}

5.3.3 Plot of the french restaurant

We then plot the time series with plotly. We used that package because it allow us to make group that can be display or not to get some more clarity. We display the code for the french cuisines plot, the other are similar except that there is only seven restaurant instead of nine for the french cuisine since we were not able to find 3 cheap one.

for (i in 1:length(french$data.x)) {
  assign(paste0("hovertext", i), paste0("Name:<b>", french$DBA[[i]], "</b><br>","Date:<b>", french$data.x[[i]][[2]], "</b><br>",
                     "Rating :<b>", french$data.x[[i]][[3]], "</b><br>"))
} 


for (i in 1:length(french$data.x)) {
  assign(paste0("ypos", i), french$data.x[[i]][[3]][french$data.x[[i]][[2]] %in% french$data[[i]][[1]]])
} 



p <- plot_ly(french) %>% 
  
  add_lines(x = french$data.x[[1]][[2]],
            y = french$data.x[[1]][[3]],
            line = list(color = "#00526d", width = 2),
            hoverinfo = "text", text = hovertext1, name = french$DBA[1],
            legendgroup = "groupmiddle",
            showlegend = T) %>%

add_markers(x = sort(french$data[[1]][[1]]),
              y = ypos1,
              marker = list(size = 10, color = "#00526d"),
              showlegend = F,
            legendgroup = "groupmiddle")%>% 
  
              add_lines(x = french$data.x[[2]][[2]],
                        y = french$data.x[[2]][[3]],
                        line = list(color = "green", width = 2),
                        hoverinfo = "text", text = hovertext2, name = french$DBA[2],
                        legendgroup = "grouphigh",
                        showlegend = T) %>%
            
            add_markers(x = sort(french$data[[2]][[1]]),
                          y = ypos2,
                          marker = list(size = 10, color = "green"),
                          showlegend = F,
                        legendgroup = "grouphigh")%>%

  add_lines(x = french$data.x[[3]][[2]],
            y = french$data.x[[3]][[3]],
            line = list(color = "green", width = 2),
            hoverinfo = "text", text = hovertext3, name = french$DBA[3],
            legendgroup = "grouphigh",
            showlegend = T) %>%

add_markers(x = sort(french$data[[3]][[1]]),
              y = ypos3,
              marker = list(size = 10, color = "green"),
              showlegend = F,
            legendgroup = "grouphigh")%>%
  
            add_lines(x = french$data.x[[4]][[2]],
                        y = french$data.x[[4]][[3]],
                        line = list(color = "#00526d", width = 2),
                        hoverinfo = "text", text = hovertext4, name = french$DBA[4],
                        legendgroup = "groupmiddle",
                        showlegend = T) %>%
            
            add_markers(x = sort(french$data[[4]][[1]]),
                          y = ypos4,
                          marker = list(size = 10, color = "#00526d"),
                          showlegend = F,
                        legendgroup = "groupmiddle")%>%
  
  add_lines(x = french$data.x[[5]][[2]],
            y = french$data.x[[5]][[3]],
            line = list(color = "green", width = 2),
            hoverinfo = "text", text = hovertext5, name = french$DBA[5],
            legendgroup = "grouphigh",
            showlegend = T) %>%

add_markers(x = sort(french$data[[5]][[1]]),
              y = ypos5,
              marker = list(size = 10, color = "green"),
              showlegend = F,
            legendgroup = "grouphigh")%>%
  
            add_lines(x = french$data.x[[6]][[2]],
                        y = french$data.x[[6]][[3]],
                        line = list(color = "#00526d", width = 2),
                        hoverinfo = "text", text = hovertext6, name = french$DBA[6],
                        legendgroup = "groupmiddle",
                        showlegend = T) %>%
            
            add_markers(x = sort(french$data[[6]][[1]]),
                          y = ypos6,
                          marker = list(size = 10, color = "#00526d"),
                          showlegend = F,
                        legendgroup = "groupmiddle")%>%
  
    add_lines(x = french$data.x[[7]][[2]],
            y = french$data.x[[7]][[3]],
            line = list(color = "red", width = 2),
            hoverinfo = "text", text = hovertext7, name = french$DBA[7],
            legendgroup = "grouplow",
            showlegend = T) %>%

add_markers(x = sort(french$data[[7]][[1]]),
              y = ypos7,
              marker = list(size = 10, color = "red"),
              showlegend = F,
            legendgroup = "grouplow")%>%
  
  layout(
    title = "Time series of french cuisine",
    legend = list(x = 100, y = 0.5),
    xaxis = list(
      rangeselector = list(
        buttons = list(
          list(
            count = 5,
            label = "5 yr",
            step = "year",
            stepmode = "backward"),
          list(
            count = 3,
            label = "3 yr",
            step = "year",
            stepmode = "backward"),
          list(
            count = 2,
            label = "2 yr",
            step = "year",
            stepmode = "backward"),
          list(
            count = 1,
            label = "1 yr",
            step = "year",
            stepmode = "backward"),
          list(step = "all"))),

      rangeslider = list(type = "date")),

    yaxis = list(title = "Rating"),
     xaxis = list(title = "Date"))
p

5.3.4 Plot of the asian restaurant

5.3.5 Plot of the american restaurant

5.3.6 Plot of the italian restaurant

In the end observing these plots, one can see that there is no drop in rating after any control and that for all type of cuisines or price. This was the goal of our project to try to conclude if there was a link of any sort between the rating and the hygiene control, and it appears that it is not the case. We thus will discuss that in the conclusion.

6 Conclusion

To conclude this report we will make some comments about our findings and then state some recommendation or idea. We are able to conclude that we found any link between the hygiene control and the rate of restaurant on TripAdvisor. Our assumption were proven wrong. So the only thing we can say is that rating on TripAdvisor are based more on a personal feeling than on some fact about the cuisines quality. This is not a bad conclusion and we could even have think so before starting our project, because it seems now logical that people are not aware of how and when hygiene control occurs.

We didn’t use time series theory to test our assumption since we know how to calculate a seasonal effect on a time series but for only repeated event on the same date each year not for particular event. Clearly we lack some knowledge in that field and it could have been interesting to test the “drop” in rating over our whole data base and not just on some graph of particular restaurant. We still think we would have come to the same conclusion since it seems logical now and since our restaurant have been chosen to be representative of the database.

We think it’s a shame that people choose restaurant base on some site that do not take these kind of information into account. The data base of hygiene control is public and daily updated and we thing it should play a role in the rating of the restaurant. Maybe people should reconsider the way they chose restaurant and even more in general how they trust the information on the internet. It would be awesome if a platform could integrate this kind of data on restaurant and thus give us a better rating of them. But in the ends it’s also important to keep in mind that how you appreciate a restaurant is still a very non-objective feeling.

Even if our findings are not really satisfying in a statistical way, we think that we were able to make interesting discovery and that this project is a success. Even more we touch a lot of subject related to data science and we learn a lot so we are quite happy with that.